- 
                Notifications
    
You must be signed in to change notification settings  - Fork 3.4k
 
Stream large transaction log jsons instead of storing in-memory #24491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
           @raunaqmorarka very interesting! I've skimmed the code, would you consider this an alternative approach to the metadata/protocol caching we were trying to push in #20437, or more as a separate improvement? I assume they'll impact the same sort of time-consuming operations. (Asking as I'm keen on getting out fork closer to upstream; it doesn't really matter which approach ends up being used as long as performance improves 👍 )  | 
    
fd27902    to
    05a9faa      
    Compare
  
    
          
 Thanks for pointing out that PR, I hadn't looked at it before. For me the priority was to deal gracefully with transaction log jsons that are GBs in size. I've tweaked the PR a bit to be better about caching metadata/protocol entries. Feel free to try this out on your workloads or add review comments.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks promissing to me.
        
          
                ...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java
          
            Show resolved
            Hide resolved
        
              
          
                ...src/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/TransactionLogEntries.java
          
            Show resolved
            Hide resolved
        
              
          
                ...src/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/TransactionLogEntries.java
          
            Show resolved
            Hide resolved
        
              
          
                plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/BaseTransactionsTable.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeConfig.java
          
            Show resolved
            Hide resolved
        
              
          
                ...ain/java/io/trino/plugin/deltalake/transactionlog/checkpoint/MetadataAndProtocolEntries.java
          
            Show resolved
            Hide resolved
        
              
          
                ...ain/java/io/trino/plugin/deltalake/transactionlog/checkpoint/MetadataAndProtocolEntries.java
          
            Show resolved
            Hide resolved
        
              
          
                ...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java
          
            Show resolved
            Hide resolved
        
              
          
                ...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java
          
            Show resolved
            Hide resolved
        
      cc4cc3a    to
    d714365      
    Compare
  
            
          
                ...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...ain/java/io/trino/plugin/deltalake/transactionlog/checkpoint/MetadataAndProtocolEntries.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...ain/java/io/trino/plugin/deltalake/transactionlog/checkpoint/MetadataAndProtocolEntries.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      Operations fetching metadata and protocol entries can skip reading the rest of the json file after those entries are found
d714365    to
    0301c34      
    Compare
  
    
Description
Operations fetching metadata and protocol entries can skip reading the rest of the json file after those entries are found
Additional context and related issues
On a example transaction log json of 1.5GB, the time taken for simple operations
like register table, DESCRIBE and SELECTs which don't use table statistics (or any read with
set session delta.statistics_enabled=false) reduces from 18s to under 1s on local machine.Such large transaction log jsons were observed to have been produced by CLONE operation from Apache Spark.
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: